AITopics | data selection

Collaborating Authors

data selection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Free Data Selection with General-Purpose Models

Neural Information Processing SystemsApr-24-2026, 06:52:18 GMT

A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a distinct data selection pipeline that utilizes existing general-purpose models to select data from various datasets with a single-pass inference without the need for additional training or supervision. A novel free data selection (FreeSel) method is proposed following this new pipeline. Specifically, we define semantic patterns extracted from intermediate features of the general-purpose model to capture subtle local information in each image. We then enable the selection of all data samples in a single pass through distance-based sampling at the fine-grained semantic pattern level.

data mining, machine learning, selection, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Data Science > Data Mining (0.94)

Add feedback

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Ye, Jiayuan, Feldman, Vitaly, Talwar, Kunal

arXiv.org Machine LearningApr-10-2026

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2604.08519

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Bridging the Gap Between Climate Science and Machine Learning in Climate Model Emulation

Schmidt, Luca, Effenberger, Nina

arXiv.org Machine LearningMar-25-2026

While climate models provide insights for climate decision-making, their use is constrained by significant computational and technical demands. Although machine learning (ML) emulators offer a way to bypass the high computational costs, their effective use remains challenging. The hurdles are diverse, ranging from limited accessibility and a lack of specialized knowledge to a general mistrust of ML methods that are perceived as insufficiently physical. Here, we introduce a framework to overcome these barriers by integrating both climate science and machine learning perspectives. We find that designing easy-to-adopt emulators that address a clearly defined task and demonstrating their reliability offers a promising path for bridging the gap between our two fields.

artificial intelligence, emulator, machine learning, (19 more...)

arXiv.org Machine Learning

2603.2232

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Neural Information Processing SystemsMar-22-2026, 10:13:09 GMT

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce, where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most influential data for the next pretraining stage. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MATES.

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Neural Information Processing SystemsMar-21-2026, 17:35:23 GMT

Despite the effectiveness of data selection for pretraining and instruction fine-tuninglarge language models (LLMs), improving data efficiency in supervised fine-tuning(SFT) for specialized domains poses significant challenges due to the complexityof fine-tuning data. To bridge this gap, we introduce an effective and scalabledata selection method for SFT, SmallToLarge (S2L), which trains a smallmodel, clusters loss trajectories of the examples, and samples from these clusters toguide data selection for larger models. We prove that during fine-tuning, sampleswithin the same loss trajectory cluster exhibit similar gradients. Then, we showthat S2L subsets have a bounded gradient error w.r.t. the full data, hence guaranteeconvergence to the neighborhood of the optimal solution. We demonstrate throughextensive experiments that S2L significantly improves data efficiency in SFT formathematical problem-solving, reducing the training data requirement to just $11$%of the original MathInstruct dataset to match full dataset performance whileoutperforming state-of-the-art data selection algorithms by an average of $4.7$%across $6$ in-and out-domain evaluation datasets. Remarkably, selecting only 50Kdata for SFT, S2L achieves a $32.7$% accuracy on the challenging MATHbenchmark, improving Phi-2 by $16.6$%. In clinical text summarization on theMIMIC-III dataset, S2L again outperforms training on the full dataset usingonly $50$% of the data. Notably, S2L can perform scalable data selection using areference model $100\times$ smaller than the target model, proportionally reducing thecomputational cost.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.39)

Add feedback

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Neural Information Processing SystemsMar-18-2026, 16:31:12 GMT

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce $\textbf{negCLIPLoss}$, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement.

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.85)
Information Technology > Artificial Intelligence > Machine Learning (0.57)

Add feedback

TSDS: Data Selection for Task-Specific Model Finetuning

Neural Information Processing SystemsMar-18-2026, 07:19:17 GMT

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data.We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques.We evaluate our method on data selection for both continued pretraining and instruction tuning of language models.We show that instruction tuning using data selected by our method with a 1\% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.

artificial intelligence, information retrieval, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.83)

Add feedback

ff80e644be415af4bcd7e4b4efb2152f-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 20:35:28 GMT

dataset, experiment, low-quality data, (13 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(4 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Banking & Finance (1.00)
Information Technology > Security & Privacy (0.93)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Security & Privacy (0.93)

Add feedback

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Neural Information Processing SystemsFeb-18-2026, 00:42:18 GMT

Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MA TES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MA

large language model, machine learning, oracle data, (19 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Michigan (0.04)
Europe (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Filters

Collaborating Authors

data selection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

cdd0640218a27e9e2c0e52e324e25db0-Paper-Conference.pdf

Towards Free Data Selection with General-Purpose Models

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Bridging the Gap Between Climate Science and Machine Learning in Climate Model Emulation

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

TSDS: Data Selection for Task-Specific Model Finetuning

ff80e644be415af4bcd7e4b4efb2152f-Paper-Conference.pdf

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models